Data analysis apparatus, data analysis method, and data analysis program

ABSTRACT

A data analysis apparatus executes: a selection process selecting a first feature variable group that is a trivial feature variable group contributing to prediction and a second feature data group other than the first feature variable group from a set of feature variables; an operation process operating a first regularization coefficient related to a first weight parameter group corresponding to the first feature variable group in a manner that the loss function is larger, and operating a second regularization coefficient related to a second weight parameter group corresponding to the second feature variable group in a manner that the loss function is smaller, among a set of weight parameters configuring a prediction model, in a loss function related to a difference between a prediction result output in a case of inputting the set of feature variables to the prediction model and ground truth data corresponding to the feature variables.

CLAIM OF PRIORITY

The present application claims priority from Japanese patent applicationJP 2019-100316 filed on May 29, 2019, the content of which is herebyincorporated by reference into this application.

BACKGROUND OF THE INVENTION 1. Field of the Invention

The present invention relates to a data analysis apparatus, a dataanalysis method, and a data analysis program for analyzing data.

2. Description of the Related Art

Machine learning is one of technologies for realizing artificialintelligence (AI). The machine learning technology is configured with alearning process and a prediction process. First, in the learningprocess, learning parameters are calculated in such a manner that anerror between a predicted value obtained from a feature variable vectorthat is an input and an actual value (true value) is minimum.Subsequently, in the prediction process, a new predicted value iscalculated from data not used in learning (hereinafter, referred to astest data).

Learning parameter calculation methods and learning parameter computingmethods to attain maximum prediction accuracy have been invented so far.With an approach called perceptron, for example, a predicted value isoutput on the basis of a computing result of linear combination betweenan input feature variable vector and a weight vector. A neural networkis also called multilayer perceptron and capable of solving linearlyinseparable problems by using a plurality of perceptrons in amulti-layered fashion. Deep learning is an approach of introducing newtechnologies such as dropout to the neural network and stepped into thelimelight as an approach capable of achieving high prediction accuracy.

In this way, development of machine learning technologies has beenunderway so far with a view to improving prediction accuracy. There isknown an approach, other than the development of the machine learningtechnologies, of improving prediction accuracy by selecting data for usein learning in advance as disclosed in International Publication No.WO2010/016110. According to International Publication No. WO2010/016110,feature variables important for prediction are selected by making use ofthe possibility of using a magnitude of each element value of the weightvector that is one of the learning parameters as an importance degree ofeach feature variable contributing to prediction in multiple regressionanalysis.

The machine learning is often used as a technology, other than theprediction of a probability of contracting a disease or a probability ofa machine failure, for identifying feature variables contributing to theprediction of the probability of contracting a disease or theprobability of a machine failure on condition that a highly accurateprediction result can be obtained.

For example, in analysis of healthcare information, it is predictedwhether a person is a patient using data about blood tests performed onpatients with a disease X and on other people, feature variablescontributing to the prediction are extracted as important featurevariables, and the important feature variables are made much use of inestablishment of a treatment policy or a daily life guidance for apatient.

With the approach of prediction by computing the linear combination asdescribed in International Publication No. WO2010/016110 and theperceptron, the feature variables contributing to the prediction areextracted by an approach of identifying the important feature variablesusing the magnitude of each element value of the weight vector.Furthermore, with an approach of prediction by computing non-linearcombination, feature variables contributing to prediction are extractedby an approach of identifying important feature variables using anout-of-bag error rate in random forests, which is one of approachesusing a decision tree.

As described in Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin.“Why should I trust you?: Explaining the predictions of any classifier.”Proceedings of the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. ACM, 2016., development of an approachcapable of extracting important feature variables is underway in deeplearning capable of solving linearly inseparable problems and the like.The development of these approaches has helped establish new effectivetreatment policies and daily life guidances.

For example, in a case in which specific feature variables are nearlyequivalent to a true value, it is possible to make highly accurateprediction using only the specific feature variables. In addition, in acase, for example, in which feature variables other than the specificfeature variables also contribute to prediction of the true value, suchpossibilities are considered to occur that importance degrees of thefeature variables other than the specific feature variables arerelatively reduced and that it is impossible to extract the otherfeature variables as the feature variables contributing to theprediction. It is estimated that the specific feature variables, inparticular, are trivial feature variable related to the disease X by theanalysis and the like performed so far.

Furthermore, to make it clear that the feature variables (hereinafter,“non-trivial feature variables”) other than the trivial featurevariables contributing to prediction (hereinafter, simply “trivialfeature variables”) among the specific feature variables contribute toprediction, it is necessary to perform prediction using only thenon-trivial feature variables. In this case, a reduction in predictionaccuracy is conceivable because of no use of the trivial featurevariables.

SUMMARY OF THE INVENTION

An object of the present invention is to provide a data analysisapparatus, a data analysis method, and a data analysis program capableof extracting non-trivial feature variables contributing to predictionas important feature variables.

A data analysis apparatus according to one aspect of the presentinvention is a data analysis apparatus including: a processor thatexecutes a program; and a storage device that stores the program, theprocessor executing: a selection process for selecting a first featurevariable group that is a trivial feature variable group contributing toprediction and a second feature variable group other than the firstfeature variable group from a set of feature variables; an operationprocess for operating a first regularization coefficient related to afirst weight parameter group corresponding to the first feature variablegroup among a set of weight parameters configuring a prediction model insuch a manner that the loss function is larger, and operating a secondregularization coefficient related to a second weight parameter groupcorresponding to the second feature variable group among the set ofweight parameters configuring the prediction model in such a manner thatthe loss function is smaller, in a loss function related to a differencebetween a prediction result output in a case of inputting the set offeature variables to the prediction model and ground truth datacorresponding to the feature variables; and a learning process forlearning the set of weight parameters of the prediction model in such amanner that the loss function is minimum as a result of operating thefirst regularization coefficient and the second regularizationcoefficient by the operation process.

According to a representative embodiment of the present embodiment, itis possible to extract feature variables contributing to predictionamong non-trivial feature variables contributing to prediction asimportant feature variables. Objects, configurations, and advantagesother than those described above will be readily apparent from thedescription of embodiments given below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an explanatory diagram 1 depicting a trivial feature variableand non-trivial feature variables;

FIG. 2 is an explanatory diagram 2 depicting a trivial feature variableand non-trivial feature variables;

FIG. 3 is a block diagram depicting an example of a hardwareconfiguration of a data analysis apparatus according to a firstembodiment;

FIG. 4 is a block diagram depicting an example of a functionalconfiguration of the data analysis apparatus according to the firstembodiment;

FIG. 5 is a flowchart depicting an example of a data analysis processprocedure by the data processing apparatus according to the firstembodiment;

FIG. 6 is an explanatory diagram depicting a display example 1 of adisplay screen;

FIG. 7 is an explanatory diagram depicting a display example 2 of thedisplay screen;

FIG. 8 is an explanatory diagram depicting a feature vector Features andground truth data Target;

FIG. 9 is an explanatory diagram depicting an experimental result;

FIG. 10 depicts an example of screen display of a data analysisapparatus according to a fourth embodiment;

FIGS. 11A and 11B are explanatory diagrams depicting an example ofreallocation of feature variable vectors;

FIG. 12 is an explanatory diagram depicting an example of a structure ofa neural network according to a fifth embodiment; and

FIG. 13 is a block diagram depicting an example of a functionalconfiguration of a data analysis apparatus according to the fifthembodiment.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

A data analysis apparatus according to a first embodiment will first bedescribed. The data analysis apparatus according to the first embodimentselects trivial feature variables and non-trivial feature variables, anddisplays a prediction result by increasing degrees of contribution ofthe non-trivial feature variables to prediction and suppressing degreesof contribution of the trivial feature variables in a learning process.

<Trivial Feature Variables and Non-Trivial Feature Variables>

FIG. 1 is an explanatory diagram 1 depicting a trivial feature variableand non-trivial feature variables. FIG. 1 depicts an example of logisticregression for predicting whether a senior high school student cangraduate from senior high school. In FIG. 1, a feature variable x_(1,n)indicates an age, a feature variable x_(2,n) indicates the number ofattendances, a feature variable x_(3,n) indicates a body height, and apredicted value y_(n) indicates whether a senior high school student cangraduate from senior high school. n indicates an n-th (where n is aninteger equal to or greater than 1) senior high school student. It isassumed that among the feature variables x_(1,n) to x_(3,n), the featurevariable x_(1,n) is a specific feature variable nearly equivalent to atrue value of a predicted value y_(n).

(A) indicates logistic regression for predicting the predicted valuey_(n) using the feature variables x_(1,n) to x_(3,n). σ indicates asigmoid function, w₁ to w₃ indicate degrees of contribution (alsoreferred to as weight parameters) to prediction of the predicted valuey_(n), and an area under curve (AUC) indicates prediction accuracy(0.00≤AUC≤1.00). A higher value of the AUC indicates higher predictionaccuracy. Since the specific feature variable x_(1,n) is nearlyequivalent to a true value of the predicted value y_(n) indicatingwhether the senior high school student can graduate from senior highschool, the specific feature variable x_(1,n) is regarded as a trivialfeature variable.

The AUC is an abbreviation of area under an ROC curve, which is an areaof a part surrounded by a horizontal axis and a vertical axis of areceiver operating characteristic curve (ROC curve), and the AUC closerto 1 means that accuracy of a model is higher. The ROC is plotted withthe horizontal axis indicating a false-positive rate and the verticalaxis indicating a true-positive rate. In other words, the AUC closer to1 refers to achieving a high true-positive rate at a time at which avalue of a false-positive rate is low; thus, it is possible to evaluatethat the model has high accuracy with a smaller bias. It is noted hereinthat the false-positive rate is a rate obtained by dividing the numberof false-positive samples by a sum of the number of false-positivesamples and the number of true-negative samples, and the true-positiverate is a rate obtained by dividing the number of true-positive samplesby a sum of the number of true-positive samples and the number offalse-negative samples.

In the first embodiment, in a case, for example, in which the predictedvalue y_(n) is a test result (positive) and a correct label t_(n) ishaving a disorder, samples (feature variables x_(n)) are true-positive.Furthermore, in a case in which the predicted value y_(n) is the testresult (positive) and a correct label t_(n) is not having a disorder,samples (feature variables x_(n)) are false-positive. Moreover, in acase in which the predicted value y_(n) is a test result (negative) anda correct label t_(n) is having a disorder, samples (feature variablesx_(n)) are false-negative. Furthermore, in a case in which the predictedvalue y_(n) is the test result (negative) and a correct label t_(n) isnot having a disorder, samples (feature variables x_(n)) aretrue-negative.

In a case in which a degree of contribution w₁ to prediction of thepredicted value y_(n) of the feature variable x_(1,n) is high, degreesof contribution of the other feature variables, that is, a degree ofcontribution w₂ of the feature variable x_(2,n) and a degree ofcontribution w₃ of the feature variable x_(3,n) are relatively low.Owing to this, it is impossible to extract the other feature variablesas feature variables contributing to prediction although the otherfeature variables actually include the feature variables contributing tothe prediction.

(B) indicates logistic regression for predicting the predicted valuey_(n) while excluding the trivial feature variable x_(1,n) among thefeature variables x_(1,n) to x_(3,n). In this case, excluding thetrivial feature variable x_(1,n) causes an increase in a value of thedegree of contribution w₂ of the feature variable x_(2,n) (w₂=0.95) in(B), while the value of the degree of contribution w₂ of the featurevariable x_(2,n) is low (w₂=0.15) in (A). In this way, the specificfeature variable x_(2,n), which also contributes to the prediction, isregarded as a feature variable contributing to prediction although beingthe non-trivial feature variable.

Therefore, the data analysis apparatus according to the presentembodiment operates parameters of a loss function in such a manner thata weight of the trivial feature variable nearly equivalent to the truevalue of the predicted value y_(n) is suppressed, operates theparameters of the loss function in such a manner that a weight of thenon-trivial feature variable is increased, and maintains the predictionaccuracy AC for the predicted value y_(n) without reducing theprediction accuracy AC.

FIG. 2 is an explanatory diagram 2 depicting a trivial feature variableand non-trivial feature variables. FIG. 2 depicts an example of logisticregression for predicting whether a college student can graduate fromcollege. In FIG. 2, the feature variable x_(1,n) indicates the number ofattendances, the feature variable x_(2,n) indicates a test score, thefeature variable x_(3,n) indicates a body height, and the predictedvalue y_(n) indicates whether a college student can graduate fromcollege. n indicates an n-th (where n is an integer equal to or greaterthan 1) college student. It is assumed that among the feature variablesx_(1,n) to x_(3,n), the feature variable x_(1,n) is a specific featurevariable which is known to contribute to prediction of the predictedvalue y_(n) despite low equivalence to a true value.

(A) indicates logistic regression for predicting the predicted valuey_(n) using the feature variables x_(1,n) to x_(3,n). Since the specificfeature variable x_(1,n) is the number of attendances, it is assumedthat a college student the number of attendances of whom is large is aserious student and evaluated as an excellent student. Since thespecific feature variable x_(1,n) is known to contribute to predictionof the predicted value y_(n), the specific feature variable x_(1,n) isregarded as a trivial feature variable.

In a case in which the degree of contribution w₁ of the feature variablex_(1,n) to prediction of the predicted value y_(n) is considerably high,the degrees of contribution of the other feature variables, that is, thedegree of contribution w₂ of the feature variable x_(2,n) and the degreeof contribution w₃ of the feature variable x_(3,n) are relatively low.Owing to this, it is impossible to extract the other feature variablesas feature variables contributing to prediction although the otherfeature variables actually include the feature variables contributing tothe prediction.

(B) indicates logistic regression for predicting the predicted valuey_(n) while excluding the trivial feature variable x_(1,n) among thefeature variables x_(1,n) to x_(3,n). In this case, the machine learningenables an increase in the value of the degree of contribution w₂ of thefeature variable x_(2,n) (w₂=0.95) in (B), while the value of the degreeof contribution w₂ of the feature variable x_(2,n) is low (w₂=0.35) in(A). In this way, the specific feature variable x_(2,n) is regarded as anon-trivial feature variable contributing to prediction.

Therefore, the data analysis apparatus according to the presentembodiment operates parameters of a loss function in such a manner thatthe weight of the trivial feature variable known to contribute toprediction of the predicted value y_(n) is reduced, operates theparameters of the loss function in such a manner that the weight of thenon-trivial feature variable is increased, and maintains the predictionaccuracy AC for the predicted value y_(n) without reducing theprediction accuracy AC.

<Example of Hardware Configuration of Data Analysis Apparatus>

FIG. 3 is a block diagram depicting an example of a hardwareconfiguration of the data analysis apparatus according to the firstembodiment. A data analysis apparatus 300 has a processor 301, a storagedevice 302, an input device 303, an output device 304, and acommunication interface (communication IF) 305. The processor 301, thestorage device 302, the input device 303, the output device 304, and thecommunication IF 305 are connected to one another by a bus 306. Theprocessor 301 controls the data analysis apparatus 300. The storagedevice 302 serves as a work area of the processor 301. Furthermore, thestorage device 302 is a non-transitory or transitory recording mediumthat stores various programs and data. Examples of the storage device302 include a read only memory (ROM), a random access memory (RAM), ahard disk drive (HDD), and a flash memory. Data is input through theinput device 303. Examples of the input device 303 include a keyboard, amouse, a touch panel, a numeric keypad, and a scanner. The output device304 outputs data. Examples of the output device 304 include a displayand a printer. The communication IF 305 is connected to a network totransmit and receive data.

<Example of Functional Configuration of Data Analysis Apparatus 300>

FIG. 4 is a block diagram depicting an example of a functionalconfiguration of the data analysis apparatus 300 according to the firstembodiment. The data analysis apparatus 300 has a data storage section401, a model storage section 402, a result storage section 403, aselection section 411, a learning section 412, an operation section 413,a prediction section 414, a degree-of-importance calculation section415, and an output section 416. The data storage section 401, the modelstorage section 402, and the result storage section 403 are specificallyrealized by, for example, the storage device 302 depicted in FIG. 3.Furthermore, the selection section 411, the learning section 412, theoperation section 413, the prediction section 414, thedegree-of-importance calculation section 415, and the output section 416are specifically realized by, for example, causing the processor 301 toexecute a program stored in the storage device 302 depicted in FIG. 3.

The data storage section 401 stores training data for use in a learningprocess by the learning section 412 and test data for use in aprediction process by the prediction section 414.

The training data is sample data configured with, for example, eachcombination {x_(d,n), t_(n)} of feature variables x_(d,n) and a correctlabel t_(n) that is a true value thereof (where d=1, 2, . . . , and D,n=1, 2, . . . , and N. D is the number of types (dimensions) of featurevariables and N is the number of samples). The feature variables x_(d,n)are, for example, test data or image data about each patient.

The test data is feature variables x_(d,n) different from the trainingdata. The combination of feature variables x_(d,n) as the test data fromwhich the predicted value y_(n) is obtained and the correct label t_(n)that is the true value thereof is handled as the training data.

The model storage section 402 stores output data from the learningsection 412. The output data contains a weight vector w_(n), whichindicates the degrees of contribution, of the feature variables x_(d,n).

The result storage section 403 stores the predicted value y_(n)calculated by a prediction process by the prediction section 414, theweight parameters w_(n) that are learning parameters, and importantfeature variables contributing to prediction and extracted by thedegree-of-importance calculation section 415.

The selection section 411 selects trivial feature variables andnon-trivial feature variables from a set of feature variables x_(d,n)that are the training data. The selection section 411 may select, as thetrivial feature variables, feature variables suggested to beacademically important in an accumulation of findings made by developersor engineers so far, documents, or the like.

In addition, the selection section 411 selects, as the non-trivialfeature variables, remaining feature variables x_(d,n) that are notselected as the trivial feature variables from among the set of featurevariables x_(d,n). In FIGS. 1 and 2, for example, the selection section411 selects the feature variable x_(1,n) as the trivial feature variableand selects the feature variables x_(2,n) and x_(3,n) as the non-trivialfeature variables.

The learning section 412 updates a hyperparameter and the weightparameters w_(n) in the following Equation (1) in such a manner that anerror between the predicted value y_(n) obtained from the featurevariables x_(d,n) that are inputs and the correct label t_(n) isminimum.

$\begin{matrix}\left\lbrack {{Expression}\mspace{14mu} 1} \right\rbrack & \; \\{{Equation}\mspace{14mu} (1)} & \; \\\begin{matrix}{y_{n} = {\sigma \left( {{w_{1}x_{1,n}} + {w_{2}x_{2,n}} + \ldots + {w_{D}x_{D,n}}} \right)}} \\{= {\sigma \left( {w^{t}x_{n}} \right)}}\end{matrix} & (1)\end{matrix}$

Equation (1) above is an example of a prediction expression of thelogistic regression that is one approach of the machine learning usingcomputing of linear combination in calculation of the predicted valuey_(n). The predicted value y_(n) is calculated on the basis of thefeature variables x_(d,n) and the weight parameters w_(n)∈R^(D) (where Dis an integer equal to or greater than 1). w_(t) is a weight vectorhaving the weight parameters w_(n) as elements, and t in the weightvector w_(t) means transpose. σ denotes an activation function such asthe sigmoid function. x_(n) is a feature variable vector having thefeature variables x_(d,n) as elements.

The learning section 412 sets a loss function L (w_(n)) for calculatingthe learning parameters (weight vector w_(t)) using above Equation (1)in such a manner that the error between the predicted value y_(n)obtained from the feature variable vector x_(n) that is the inputs andthe correct label t_(n) that is an actual value (true value) is minimum.Specifically, the learning section 412 sets, for example, weightparameters w_(k,n) of trivial feature variables x_(k,n) selected by theselection section 411 and weight parameters w_(h,n) of non-trivialfeature variables x_(h,n) selected by the selection section 411 to adegree-of-contribution operation term R_(P)(w^(t) _(n)).

The loss function L(w_(n)) is represented by a sum of an error functionE(w_(n)) and a degree-of-contribution operation term R_(P)(w_(n)), asdepicted in the following Equations (2) and (3).

$\begin{matrix}\left\lbrack {{Expression}\mspace{14mu} 2} \right\rbrack & \; \\{{Equation}\mspace{14mu} (2)} & \; \\{{\mathcal{L}\left( w_{n} \right)} = {{E\left( w_{n} \right)} + {R_{p}\left( w_{n} \right)}}} & (2) \\{{Equation}\mspace{14mu} (3)} & \; \\{{R_{p}\left( w_{n} \right)} = {\lambda {\sum\limits_{j = 1}^{N}{w_{j}}_{p}}}} & (3)\end{matrix}$

w_(n) is a weight vector having, as elements, weight parameters w₁ tow_(D) corresponding to feature variables x_(1,n) to x_(d,n) of thefeature variable vector x_(n) that is an n-th sample. The error functionE(w_(n)) may be, for example, a mean squared error or a cross entropyerror between the predicted value y_(n) and the correct label t_(n).

Furthermore, Equation (3) is the degree-of-contribution operation termR_(P)(w_(n)). A hyperparameter in the degree-of-contribution operationterm R_(P)(w_(n)) is set by the operation section 413. In Equation (3),λ (0.0≤λ≤1.0) is a loss coefficient. As λ is larger, a value of the lossfunction L(w_(n)) becomes higher. p denotes a norm dimension.

Moreover, a prediction expression in a case in which the weight vectorw_(n) is present in each feature variable x_(d,n) is expressed by, forexample, the following Equation (4) by the machine learning approach.

$\begin{matrix}\left\lbrack {{Expression}\mspace{14mu} 3} \right\rbrack & \; \\{{Equation}\mspace{14mu} (4)} & \; \\\begin{matrix}{y_{n} = {\sigma \left( {{w_{1,n}x_{1,n}} + {w_{2,n}x_{2,n}} + \ldots + {w_{D,n}x_{D,n}}} \right)}} \\{= {\sigma \left( {w_{n}^{t}x_{n}} \right)}}\end{matrix} & (4)\end{matrix}$

Furthermore, a loss function L(w^(t) _(n)) is represented by a sum of anerror function E(w^(t) _(n)) and a degree-of-contribution operation termR_(P) (w^(t) _(n)), as depicted in the following Equations (5) and (6).

$\begin{matrix}\left\lbrack {{Expression}\mspace{14mu} 4} \right\rbrack & \; \\{{Equation}\mspace{14mu} (5)} & \; \\{{\mathcal{L}\left( w_{n}^{t} \right)} = {{E\left( w_{n}^{t} \right)} + {R_{p}\left( w_{n}^{t} \right)}}} & (5) \\{{Equation}\mspace{14mu} (6)} & \; \\{{R_{p}\left( w_{n}^{t} \right)} = {\frac{\lambda}{N}{\sum_{j = i}^{N}{w_{j}^{t}}_{p}}}} & (6)\end{matrix}$

Furthermore, the degree-of-contribution operation term R_(P) (w^(t)_(n)) of Equation (6) may be replaced by a degree-of-contributionoperation term R₁ (w_(n)) of the following Equation (7) with normdimension p=1.

$\begin{matrix}\left\lbrack {{Expression}\mspace{14mu} 5} \right\rbrack & \; \\{{Equation}\mspace{14mu} (7)} & \; \\{{R_{1}\left( w_{n} \right)} = {\frac{\lambda}{N}{\sum\limits_{n = 1}^{N}\left( {{\mu {\sum\limits_{k \in T}{w_{k,n}}}} + {v{\sum\limits_{h \in U}{w_{h,n}}}}} \right)}}} & (7)\end{matrix}$

In the degree-of-contribution operation term R₁ (w_(n)) of Equation (7),λ is the loss coefficient described above, μ is a first regularizationcoefficient related to the weight parameters w_(k,n) of the trivialfeature variables x_(k,n) and ν is a second regularization coefficientrelated to the weight parameters w_(h,n) of the non-trivial featurevariables x_(h,n). A relationship between the first regularizationcoefficient μ and the second regularization coefficient ν is, forexample, ρ+ν=1.0. λ, μ, and ν are hyperparameters. Furthermore, kindicates a number representing the trivial feature variables x_(k,n), Tindicates the number of the trivial feature variables x_(k,n), hindicates a number representing the non-trivial feature variables, and Uindicates the number of the non-trivial feature variables x_(h,n).

Adding the degree-of-contribution operation term R₁ (w_(n)) to the errorfunction E(w_(n) ^(t)) by the learning section 412 makes it possible toproduce effects of preventing growth of the weight parameters w_(k,n) ofthe trivial feature variables x_(k,n) and of obtaining a sparse model.

Moreover, the degree-of-contribution operation term R_(P)(w^(t) _(n)) ofEquation (6) may be replaced by a degree-of-contribution operation termR₂(w_(n)) of the following Equation (8) with the norm dimension p=2.

$\begin{matrix}\left\lbrack {{Expression}\mspace{14mu} 6} \right\rbrack & \; \\{{Equation}\mspace{14mu} (8)} & \; \\{{R_{2}\left( w_{n} \right)} = {\frac{\lambda}{N}{\sum_{n = 1}^{N}\left( {{\mu {\sum_{k \in T}{w_{k,n}}}} + {v{\sum_{h \in U}{w_{h,n}}}}} \right)}}} & (8)\end{matrix}$

Adding the degree-of-contribution operation term R₂(w_(n)) to the errorfunction E(w_(n)) by the learning section 412 makes it possible toproduce effects of preventing growth of the weight parameters w_(k,n) ofthe trivial feature variables x_(k,n) and of suppressing overfitting toobtain a smooth prediction model.

Furthermore, the degree-of-contribution operation term R_(P)(w^(t) _(n))may be replaced by a degree-of-contribution operation termR_(els)(w_(n)) of the following Equation (9).

$\begin{matrix}{\mspace{79mu} \left\lbrack {{Expression}\mspace{14mu} 7} \right\rbrack} & \; \\{\mspace{79mu} {{Equation}\mspace{14mu} (9)}} & \; \\{{R_{{els}.}\left( w_{n} \right)} = {\frac{\lambda}{N}{\sum\limits_{n = 1}^{N}\left( {{\mu {\sum\limits_{k \in T}\left\{ {{\alpha {w_{k,n}}} + {\left( {1 - \alpha} \right){w_{h,n}}}} \right\}}} + {v{\sum\limits_{h \in U}\left\{ {{\alpha {w_{h,n}}} + {\left( {1 - \alpha} \right){w_{h,n}}}} \right\}}}} \right)}}} & (9)\end{matrix}$

Equation (9) is an equation of an elastic net of linear combinationbetween an L1 norm and an L2 norm of each weight vector w_(n), and is adegree-of-contribution operation term obtained by the linear combinationbetween Equations (7) and (8). In Equation (9), α(0.0≤α≤1.0) is anelastic coefficient. The elastic coefficient α is also a hyperparameter.

Adding the degree-of-contribution operation term R_(els)(w_(n)) to theerror function E(w_(n)) makes it possible to produce an effect ofpreventing the growth of the weight parameters w_(k,n) of the trivialfeature variables x_(k,n) to obtain a sparse model as depicted inEquation (7) and an effect of suppressing overfitting to obtain a smoothmodel as depicted in Equation (8).

The operation section 413 operates the hyperparameter in thedegree-of-contribution operation term for increasing the degrees ofcontribution of the non-trivial feature variables to prediction andreducing the degrees of contribution of the trivial feature variables toprediction. The operation section 413 operates the hyperparametersdescribed above, that is, the loss coefficient λ, the firstregularization coefficient μ related to the weight parameters w_(k,n) ofthe trivial feature variables x_(k,n) the second regularizationcoefficient ν related to the weight parameters w_(h,n) of thenon-trivial feature variables x_(h,n), and the elastic coefficient α.Since the first regularization coefficient μ and the secondregularization coefficient ν are each set to a value from 0.0 to 1.0, itis possible to facilitate control over a degree of suppression of theweight parameters w_(k,n) of the trivial feature variables x_(k,n).

Furthermore, the operation section 413 operates the first regularizationcoefficient μ and the second regularization coefficient ν in such amanner that a sum of the first regularization coefficient μ and thesecond regularization coefficient ν is, for example, 1.0. The operationsection 413 operates the first regularization coefficient μ and thesecond regularization coefficient ν in such a manner that the firstregularization coefficient μ is greater than the second regularizationcoefficient ν. The operation section 413 may operate the hyperparameterson condition that the first regularization coefficient μ related to theweight parameters w_(k,n) of the trivial feature variables x_(k,n) isgreater than 0.5.

On this condition, if the first regularization coefficient μ is setgreater than the second regularization coefficient ν, a value of theterm multiplied by the first regularization coefficient μ becomesgreater within the degree-of-contribution operation term R_(P) (w^(t)_(n)). Owing to this, learning is carried out in such a manner thatvalues of a first weight parameter group, that is, the trivial featurevariables x_(k,n) are made smaller than values of a second weightparameter group, that is, the non-trivial feature variables x_(h,n) tomake the loss function L (w^(t) _(n)) smaller. As a result, it ispossible to suppress the weight parameters w_(k,n) of the trivialfeature variables x_(k,n), compared with a case of not using thedegree-of-contribution operation term R_(P) (w^(t) _(n)). Furthermore, arange of the value of the first regularization coefficient μ may be set,for example, to be equal to or greater than 0.7.

Moreover, the norm dimension p may be set to 0.5 or the like whileexamples of the L1 norm, the L2 norm, and the elastic net have beenillustrated.

The prediction section 414 executes a prediction process by giving thefeature variables x_(d,n) as the test data to a prediction model forwhich the weight vector w_(n) is applied to Equations (1) and (2), andoutputs the predicted value y_(n) to the result storage section 403 andthe output section 416.

In addition, the prediction section 414 calculates the AUC for thepredicted value y_(n) described above. A case in which the AUC is equalto or smaller than a threshold means a failure in prediction. In thiscase, the operation section 413 may re-operate each hyperparameter andthe learning section 412 may perform relearning of the weight vectorw_(n).

The degree-of-importance calculation section 415 aligns the featurevariables x_(d,n) in descending order of contribution to predictionusing the weight vector w_(n) stored in the model storage section 402,and carries out calculation to regard the feature variables x_(d,n) asimportant feature variables in the descending order of contribution toprediction. The descending order of contribution to prediction is, forexample, descending order of norms of the weight vector w_(n). Thedegree-of-importance calculation section 415 calculates the norms of theweight vector w_(n).

The degree-of-importance calculation section 415 assigns degrees ofimportance to the feature variables x_(d,n) in the descending order ofcontribution to prediction. The degree of importance is proportional tothe norm and takes on a greater value as the norm is higher. Thedegree-of-importance calculation section 415 may add a weight of a valueequal to or greater than 0.0 and smaller than 1.0 to the norms of theweight vector w_(n) of the trivial feature variables. Furthermore, thedegree-of-importance calculation section 415 may exclude the trivialfeature variables at a time of aligning the feature variables x_(d,n) inthe descending order of contribution to prediction.

Moreover, the degree-of-importance calculation section 415 may assignthe norms themselves as the degrees of importance. To calculate thedegree of importance, an out-of-bag error rate may be used without usingthe weight vector w_(n) depending on the machine learning approach to beused.

The selection section 411 may thereby further select the trivial featurevariables and the non-trivial feature variables while referring to thedegrees of importance calculated by the degree-of-importance calculationsection 415.

It is noted that a plurality of apparatuses may configure the dataanalysis apparatus 300. For example, a plurality of data analysisapparatuses 300 may be present for load distribution. Alternatively, aplurality of apparatuses including one or more functions may configurethe data analysis apparatus 300.

<Example of Data Analysis Process Procedure>

FIG. 5 is a flowchart describing an example of a data analysis processprocedure by the data analysis apparatus 300 according to the firstembodiment. The data analysis apparatus 300 reads, by the selectionsection 411, a training data set from the data storage section 401 (StepS501), and then selects, by the selection section 411, the trivialfeature variables and the non-trivial feature variables from among thetraining data set (Step S502).

The data analysis apparatus 300 then generates the weight parametersw_(n) using the loss function L of Equation (2) or (5) in such a mannerthat the error between the predicted value y_(n) obtained by giving thefeature variables x_(d,n) of the training data set and the correct labelt_(n) is minimum (Step S503). Steps S501 to S503 correspond to thelearning process.

The data analysis apparatus 300 reads, by the prediction section 414, atest data set from the data storage section 401 (Step S504). The dataanalysis apparatus 300 calculates the predicted value y_(n) by givingthe feature variables x_(d,n) of the test data set to the predictionmodel for which the weight parameters w_(n) are set to the predictionexpression of Equation (1) or (4) (Step S505).

The data analysis apparatus 300 extracts, by the degree-of-importancecalculation section 415, the degrees of importance of the featurevariables (Step S506). Next, the data analysis apparatus 300 saves acombination of the predicted value y_(n) and the degrees of importancein the result storage section 403 (Step S507). The data analysisapparatus 300 then outputs, by the output section 416, the combinationof the predicted value y_(n) and the degrees of importance (Step S508).

Further, the data analysis apparatus 300 operates, by the operationsection 413, the hyperparameters, that is, the loss coefficient λ, thefirst regularization coefficient μ related to the weight parametersw_(k,n) of the trivial feature variables x_(k,n) the secondregularization coefficient ν related to the weight parameters w_(h,n) ofthe non-trivial feature variables x_(h,n), and the elastic coefficient α(Step S509).

<Examples of Display Screen>

FIG. 6 is an explanatory diagram depicting a display example 1 of adisplay screen. A display screen 600 is displayed on a display that isan example of the output device 304 in the data analysis apparatus 300or a display of a computer to which the output section 416 outputs data.

The display screen 600 includes an import file button 601, a featureselect button 602, a train button 603, a predict button 604, a savebutton 605, a file name box 606, and a select screen 610.

Upon detecting depression of the import file button 601 by a user'soperation, the data analysis apparatus 300 selects the training data foruse by the learning section 412, the test data for use by the predictionsection 414, a determined optimum model, a prediction result, thedegrees of importance, and the like by a user's operation. Names of theselected data are displayed in the file name box 606. Subsequently, upondepression of the Feature select button 602 by a user's operation, thedata analysis apparatus 300 displays, by the selection section 411, theselect screen 610 of the feature variables.

A user places a checkmark in each feature variable to be selected as thetrivial feature variable, for example, in a checkbox 611. The selectionsection 411 selects the feature variables in the checked checkboxes asthe trivial feature variables. When the selection of the featurevariables is over and learning is started, the user depresses the trainbutton 603. The learning section 412 thereby starts the learningprocess. Subsequently, the user selects test data and depresses thepredict button 604. The prediction section 414 thereby starts theprediction process.

FIG. 7 is an explanatory diagram depicting a display example 2 of thedisplay screen. On the display screen 600, a degree of importance of thepredicted value y_(n), and a suppression effect of the weight parametersw_(k,n) of each trivial feature variable x_(k,n) are displayed after endof the prediction process. The predicted value y_(n) is displayed in anaccuracy display area 711. In addition, the weight parameter w_(d,n) ofeach feature variable x_(d,n) in ordinary prediction and a result ofsuppressing the weight parameter w_(k,n) of each trivial featurevariable x_(k,n) by the operation section 413 are displayed side by sidein a suppression effect display area 712.

While a comparison between the ordinary prediction and the suppressionresult is displayed in FIG. 7, only the suppression result may bedisplayed. Furthermore, a value displayed as the weight parameterw_(k,n) of each trivial feature variable x_(k,n) may be a value of theactual weight parameter w_(k,n), a normalized value in each sample n, oran average value obtained by normalizing the value in each sample n andthen adding up the normalized values in all samples 1 to N or by a totalcross-validation.

In a case of saving these analysis results, the user depresses the savebutton 605. A screen on which a memory space in which the analysisresults are to be saved can be designated is thereby displayed. Uponuser's designating the memory space and depressing an execution button,the analysis results are saved in the designated memory space. Thememory space in which the analysis results are saved is displayed in anexport file name box 701 or the like.

In this way, according to the first embodiment, using the loss functionfor setting different penalties between the trivial feature variablesand the non-trivial feature variables in the machine learningaccountable for grounds for prediction makes it possible to realizeprediction while suppressing the degrees of contribution (weightparameters w_(k,n)) of the trivial feature variables x_(k,n) toprediction and making active use of the other non-trivial featurevariables x_(h,n). This makes it possible to extract unknown featurevariables that may be feature variables not discovered yet in academicfindings or the like contributing to prediction.

Second Embodiment

A second embodiment will be described. In the first embodiment, the twofeature variable groups, that is, a trivial feature variable group and anon-trivial feature variable group are selected. The second embodimentis an example of increasing the number of groups of feature variablesfor operating the degrees of contribution such as trivial featurevariables, non-trivial feature variables, and trivial feature variablesnot contributing to prediction, compared with the number of featurevariable groups in the first embodiment. It is noted that the sameconstituent elements as those in the first embodiment are denoted by thesame reference characters and are not often described.

The selection section 411 selects trivial feature variables, in whichcontributing to prediction is trivial, and feature variables, in whichcontributing to prediction is non-trivial, as well as trivial featurevariables, in which not contributing to prediction is trivial, from theset of feature variables x_(d,n) as the training data. The selectionsection 411 may select, as the trivial feature variables, featurevariables suggested to be academically important in the accumulation offindings made by developers or engineers so far, documents, or the like.

In addition, the selection section 411 may select, as the trivialfeature variables not contributing to prediction, feature variablessuggested to be not academically important in the accumulation offindings made by developers or engineers so far, documents, or the like.Furthermore, the selection section 411 may select, as the non-trivialfeature variables, remaining feature variables x_(d,n) that are notselected as the trivial feature variables and the trivial featurevariables not contributing to prediction from among the set of featurevariables x_(d,n). In FIGS. 1 and 2, for example, the selection section411 selects the feature variable x_(1,n) as the trivial featurevariable, selects the feature variable x_(2,n) and x_(3,n) as thenon-trivial feature variable, and selects the feature variable x_(3,n)as the trivial feature variable not contributing to prediction.

The operation section 413 operates the hyperparameters in thedegree-of-contribution operation term for increasing the degrees ofcontribution of the non-trivial feature variables to prediction andreducing the degrees of contribution of the trivial feature variables toprediction and the degrees of contribution of the trivial featurevariables not contributing to prediction. The degree-of-contributionoperation term R_(P)(w^(t) _(n)) is replaced by a degree-of-contributionoperation term R₁(w_(n)) of the following Equation (10) with the normdimension p=1.

$\begin{matrix}\left\lbrack {{Expression}\mspace{14mu} 8} \right\rbrack & \; \\{{Equation}\mspace{14mu} (10)} & \; \\{{R_{1}\left( w_{n} \right)} = {\frac{\lambda}{N}{\sum\limits_{n = 1}^{N}\left( {{\mu {\sum\limits_{k \in T}{w_{k,n}}}} + {v{\sum\limits_{h \in U}{w_{h,n}}}} + {\tau {\sum\limits_{l \in V}{w_{l,n}}}}} \right)}}} & (7)\end{matrix}$

Equation (10) is an example of the degree-of-contribution operation termR₁(w_(n)) of the L1 norm. τ is a third regularization coefficientrelated to weight parameters w_(1,n) of trivial feature variablesx_(1,n) in which not contributing to prediction is trivial. τ is also ahyperparameter. l indicates a number of the trivial feature variables,in which not contributing to prediction is trivial, and V is the numberof the non-trivial feature variables. The degree-of-contributionoperation term R₁ (w_(n)) of Equation (10) is added to the errorfunction E(w_(n) ^(t)) by the learning section 412 as depicted inEquation (5). The learning section 412 thereby calculates the lossfunction L(w_(n) ^(t)) and updates the weight parameters w_(k,n),w_(h,n), and w_(1,n).

By doing so, it is possible to produce effects of preventing growth ofthe weight parameters w_(k,n) of the trivial feature variables x_(k,n)and the weight parameters w_(1,n) of the trivial feature variablesx_(1,n), in which not contributing to prediction is trivial, and ofobtaining a sparse model.

Moreover, the degree-of-contribution operation term R_(P)(w^(t) _(n)) ofEquation (6) may be replaced by a degree-of-contribution operation termR₂(w_(n)) of the following Equation (11) with the norm dimension p=2.

$\begin{matrix}\left\lbrack {{Expression}\mspace{14mu} 9} \right\rbrack & \; \\{{Equation}\mspace{14mu} (11)} & \; \\{{R_{2}\left( w_{n} \right)} = {\frac{\lambda}{N}{\sum\limits_{n = 1}^{N}\left( {{\mu {\sum\limits_{k \in T}{w_{k,n}}}} + {v{\sum\limits_{h \in U}{w_{h,n}}}} + {\tau {\sum\limits_{l \in V}{w_{l,n}}}}} \right)}}} & (11)\end{matrix}$

Adding the degree-of-contribution operation term R₂(w_(n)) to the errorfunction E(w_(n)) by the learning section 412 makes it possible toproduce effects of preventing growth of the weight parameters w_(k,n) ofthe trivial feature variables x_(k,n) and the weight parameters w_(1,n)of the trivial feature variables, in which not contributing toprediction is trivial, and of suppressing overfitting to obtain a smoothprediction model.

Furthermore, the degree-of-contribution operation term R_(P)(w^(t) _(n))of Equation (6) may be replaced by a degree-of-contribution operationterm R_(els)(w_(n)) of the following Equation (12).

$\begin{matrix}{\mspace{79mu} \left\lbrack {{Expression}\mspace{14mu} 10} \right\rbrack} & \; \\{\mspace{79mu} {{Equation}\mspace{14mu} (12)}} & \; \\{{R_{{els}.}\left( w_{n} \right)} = {\frac{\lambda}{N}{\sum\limits_{n = 1}^{N}\left( {{\mu {\sum\limits_{k \in T}\left\{ {{\alpha {w_{k,n}}} + {\left( {1 - \alpha} \right){w_{k,n}}}} \right\}}} + {v{\sum\limits_{h \in U}\left\{ {{\beta {w_{h,n}}} + {\left( {1 - \beta} \right){w_{h,n}}}} \right\}}} + {\tau {\sum\limits_{l \in V}\left\{ {{\gamma {w_{l,n}}} + {\left( {1 - \gamma} \right){w_{l,n}}}} \right\}}}} \right)}}} & (12)\end{matrix}$

Equation (12) is an equation of an elastic net of linear combinationbetween the L1 norm and the L2 norm of each weight vector w_(n), and isa degree-of-contribution operation term obtained by the linearcombination between Equations (10) and (11). In Equation (12),α(0.0≤α≤1.0) is the elastic coefficient. The elastic coefficient α isalso a hyperparameter.

Adding the degree-of-contribution operation term R_(els)(w_(n)) to theerror function E(w_(n)) makes it possible to produce an effect ofpreventing the growth of the weight parameters w_(k,n) of the trivialfeature variables x_(k,n) and the weight parameters w_(1,n) of thetrivial feature variables x_(1,n), in which not contributing toprediction is trivial, to obtain a sparse model as depicted in Equation(10) and an effect of suppressing overfitting to obtain a smooth modelas depicted in Equation (11).

Furthermore, the operation section 413 operates the first regularizationcoefficient μ, the second regularization coefficient ν, and the thirdregularization coefficient T in such a manner that a sum of the firstregularization coefficient μ, the second regularization coefficient ν,and the third regularization coefficient τ is, for example, 1.0. Theoperation section 413 operates the first regularization coefficient μ,the second regularization coefficient ν, and the third regularizationcoefficient τ in such a manner the first regularization coefficient μand the third regularization coefficient τ are greater than the secondregularization coefficient ν. The operation section 413 may operate thefirst regularization coefficient μ, the second regularizationcoefficient ν, and the third regularization coefficient τ on conditionthat one of the first and third regularization coefficients μ and τ isgreater than 0.5.

By doing so, with growth of the weight parameters w_(k,n) of the trivialfeature variables x_(k,n) and the weight parameters of the trivialfeature variables x_(1,n) in which not contributing to prediction istrivial, regularization terms of the first regularization coefficient μand the third regularization coefficient τ grow. It is, therefore,possible to suppress the weight parameters w_(k,n) of the trivialfeature variables x_(k,n) and the weight parameters w_(1,n) of thetrivial feature variables x_(1,n), in which not contributing toprediction is trivial, and grow the values of the weight parametersw_(h,n) of the non-trivial feature variables x_(h,n), compared with thecase of not using the degree-of-contribution operation term R_(P)(w^(t)_(n)). Furthermore, a range of the value of one of the firstregularization coefficient μ and the third regularization coefficient τmay be set, for example, to be equal to or greater than 0.7.

Moreover, the selection section 411 may comprehensively change featurevariables designated as trivial feature variables and select the trivialfeature variables on the basis of a result of carrying out the firstembodiment. Specifically, the selection section 411 selects, forexample, only one trivial feature variable, carries out the firstembodiment, and obtains the prediction accuracy (AUC, a determinationcoefficient r², or the like) and the degree of importance.

Subsequently, the data analysis apparatus 300 changes the only featurevariable to be selected and carries out the first embodiment by as muchas the number of all feature variables. Furthermore, the data analysisapparatus 300 increases the number of designated feature variables totwo, similarly carries out the first embodiment by combinations of allfeature variables, further increases the number of designated featurevariables, and carries out the first embodiment in all patterns in whichfeature variables can be selected as the trivial feature variables.Subsequently, in a case in which the prediction accuracy is equal to orhigher than a threshold, the selection section 411 lists up the featurevariables selected as the trivial feature variables and the combinationsof the feature variables, and selects the trivial feature variables fromamong the feature variables and the combinations.

The listed feature variables can be interpreted as important featurevariables for realizing accurate prediction. At this time, the dataanalysis apparatus 300 may select the feature variables in descendingorder of frequency of appearance in the listed feature variables and thecombinations of the feature variables in sequence. The selection section411 can thereby dynamically select the trivial feature variables and thenon-trivial feature variables.

Moreover, as a result of carrying out the first embodiment in allpatterns in which feature variables can be selected as trivial featurevariables by the data analysis apparatus 300, the selection section 411may refer to the obtained degrees of importance and select the featurevariables having higher degrees of importance as trivial featurevariables although the feature variables are designated as the trivialfeature variables. It can be interpreted that the feature variableshaving higher degrees of importance despite suppression of the degreesof contribution to prediction are the important feature variables forrealizing accurate prediction. At this time, the data analysis apparatus300 may select the feature variables in descending order of frequency ofappearance in the feature variables listed as those having the degreesof importance equal to or higher than a threshold and the combinationsof the feature variables in sequence although the feature variables aredesignated as the trivial feature variables. The selection section 411can thereby dynamically select the trivial feature variables and thenon-trivial feature variables.

In this way, according to the second embodiment, using the loss functionL (w_(n) ^(t)) for setting different penalties among the trivial featurevariables, the non-trivial feature variables, and the trivial featurevariables, in which not contributing to prediction is trivial, in themachine learning accountable for grounds of prediction makes it possibleto realize prediction while suppressing the degrees of contribution ofthe trivial feature variables to prediction and those of the trivialfeature variables, in which not contributing to prediction is trivial,and making active use of the non-trivial feature variables. This makesit possible to extract unknown feature variables that may be featurevariables not discovered yet in academic findings or the like despitecontribution to prediction.

Third Embodiment

A third embodiment will be described. The third embodiment is an examplerelated to a method of selecting trivial feature variables andnon-trivial feature variables by the selection section 411. It is notedthat the same constituent elements as those in the first and secondembodiments are denoted by the same reference characters and are notoften described.

In the first and second embodiments, the selection section 411designates the trivial feature variables from among the featurevariables already indicated to be academically important in documents orthe like or the accumulation of findings made by developers andengineers so far in selecting the trivial feature variables. In thethird embodiment, the selection section 411 selects trivial featurevariables on the basis of degrees of actual contributions to prediction.To describe a selection method based on the degree of contributing toprediction, as an example of predicting Boston house prices, aperformance verification is carried out on the basis of data used inHarrison, D. and Rubinfeld, D. L. (1978) Hedonic prices and the demandfor clean air. J. Environ. Economics and Management 5, 81-102.

FIG. 8 is an explanatory diagram depicting a feature variable vectorFeatures and ground truth data Target. In an experiment, prediction isapplied in a case of using 10-fold cross validation and all thirteenfeature variables in (1) to (13), two feature variables having thedegrees of importance accounting for top 20% are selected as the trivialfeature variables from among the feature variables contributing toprediction, and the first embodiment is carried out.

FIG. 9 is an explanatory diagram depicting an experimental result. Aprediction result by the data analysis apparatus 300 in a case in whichthe operation section 413 does not operate the hyperparameters is aNormal graph, while a prediction result by the data analysis apparatus300 in a case in which the operation section 413 operates thehyperparameters is a Suppression graph. Since the determinationcoefficient r² (=0.75) in the Normal graph exceeds 0.7, the dataanalysis apparatus 300 calculates the degrees of importance with respectto degrees of contribution to accurate prediction.

The selection section 411 compares a magnitude of the weight vectorw_(n) among the feature variables, and selects the feature variables (6)and (13) that are top two feature variables as trivial featurevariables. The operation section 413 operates the first regularizationcoefficient μ related to the weight parameters w_(k,n) of the trivialfeature variables x_(k,n) to be equal to or greater than 0.5 usingEquation (7). The learning section 412 generates learning parameters(weight vector w_(n)) in the learning process. The selection section 411compares again the magnitude of the weight vector w_(n) among thefeature variables.

Since the determination coefficient r²(=0.82) exceeds 0.7, it isunderstood that prediction is carried out with high prediction accuracyeven after operating the first regularization coefficient μ. Comparisonof the magnitude of the weight vector w_(n) in Normal prediction withthe magnitude of the weight vector w_(n) in Suppression predictionindicates that the weight vector w_(n) of the feature variables (6) and(13) can be suppressed and that the magnitude of the weight vector w_(n)small in the Normal prediction can be grown.

While the feature variables having the top 20% degrees of importanceamong the feature variables contributing to prediction are selected asthe trivial feature variables in the third embodiment, a percentage maybe 50% or the like or the number of the trivial feature variables may bedetermined in advance. Furthermore, while the selection method based onthe degrees of contribution to prediction has been described in thethird embodiment, the selection section 411 may select the trivialfeature variables on the basis of a prediction result. The selectionsection 411 may select the trivial feature variables until theprediction result indicating, for example, that the determinationcoefficient r² or the AUC is equal to or smaller than 0.8.

In this way, according to the third embodiment, using the loss functionfor setting different penalties between the trivial feature variablesand the non-trivial feature variables in the machine learningaccountable for grounds for prediction makes it possible to realizeprediction while suppressing the degrees of contribution (weightparameters w_(k,n)) of the trivial feature variables x_(k,n) toprediction and making active use of the other non-trivial featurevariables x_(h,n). This makes it possible to extract unknown featurevariables that may be feature variables not discovered yet in academicfindings or the like and contributing to prediction.

Fourth Embodiment

A fourth embodiment will be described. The fourth embodiment is anexample related to a method of determining the first regularizationcoefficient μ of the trivial feature variables and the secondregularization coefficient ν of the non-trivial feature variables by theoperation section 413. It is noted that the same constituent elements asthose in the first embodiment are denoted by the same referencecharacters and are not often described.

In the first embodiment, the operation section 413 determines theregularization term of the trivial feature variables and that of thenon-trivial feature variables on condition that the range of the valueof each of the first and second regularization coefficients μ and ν isset in such a manner that the sum of the first regularizationcoefficient μ of the trivial feature variables and the secondregularization coefficient ν of the non-trivial feature variables isequal to 1 and that the first regularization coefficient μ of thetrivial feature variables is greater than 0.5. In the fourth embodiment,an example of generating learning parameters having highest predictionaccuracy in a designated range of values on the above condition by thelearning section 412 will be described.

FIG. 10 depicts an example of screen display of the data analysisapparatus 300 according to the fourth embodiment. As depicted in FIG.10, a slider 1001 that is an example of a user interface and thatadjusts values of the first regularization coefficient μ and the secondregularization coefficient ν may adjust the values of the firstregularization coefficient μ of the trivial feature variables and thesecond regularization coefficient ν of the non-trivial feature variablesin determining the first regularization coefficient μ and the secondregularization coefficient ν. Furthermore, the values of the firstregularization coefficient μ and the second regularization coefficient νmay be subsequently changed again after confirming the magnitude of theweight vector w_(n) as depicted in FIG. 7.

Moreover, as a method of determining values, the user may set the firstregularization coefficient μ of the trivial feature variables to a fixedvalue such as 0.9 or values in a desired balance may be selected on thebasis of a degree of suppression of the weight vector w_(n) and theprediction accuracy.

In this way, according to the fourth embodiment, using the loss functionfor setting different penalties between the trivial feature variablesand the non-trivial feature variables in the machine learningaccountable for grounds for prediction makes it possible to realizeprediction while suppressing the degrees of contribution (weightparameters w_(k,n)) of the trivial feature variables x_(k,n) toprediction and making active use of the other non-trivial featurevariables x_(h,n). This makes it possible to extract unknown featurevariables that may be feature variables not discovered yet in academicfindings or the like and contributing to prediction.

Fifth Embodiment

In a fifth embodiment, an example of calculating degrees of importanceused in the first to fourth embodiments will be described. It is notedthat the same constituent elements as those in the first to fourthembodiments are denoted by the same reference characters and are notoften described.

<Example of Reallocation of Feature Vectors>

While artificial intelligence (AI) has a capability of solving linearlyinseparable problems, it is unclear why the AI made such a judgment. Amachine learning approach such as deep learning, in particular, is highin prediction accuracy but low in accountability. For example, in a casein which the AI output a diagnosis result that “prone to catch a cold”to a certain patient, a doctor is unable to answer a question why the AIobtained such a result. If the AI can make a judgment up to a cause of asymptom, the doctor can give proper treatment to the patient.

FIGS. 11A and 11B are explanatory diagrams depicting an example ofreallocation of feature variable vectors. In (A), in a feature variablespace SP1, a plurality of feature variable vectors x_(n) (n=1, 2, . . ., N, where N is the number of images) are present. The plurality offeature variable vectors x_(n) are discriminated as correct labels Laand Lb by, for example, a nonlinear prediction model PM1. In (B), in afeature space SP2, a plurality of feature variable vectors x_(n) arepresent. The plurality of feature variable vectors x_(n) arediscriminated as correct labels La and Lb by, for example, a nonlinearprediction model PM2.

In (A), the machine learning approach such as the deep learning learnslinear regression anew for explaining the prediction model PM1 that is adiscrimination result. Specifically, the machine learning approachexecutes, for example, a retrofitted process of obtaining the predictionmodel PM1 and then locally performing straight-line approximation on theprediction model PM1. However, it is unclear in such a retrofittedprocess whether a straight-line approximated local part of theprediction model PM1 can correctly explain the feature variable vectorsx_(n). Furthermore, and more importantly, executing logistic regressioncalled straight-line approximation makes it necessary to execute machinelearning twice after all.

Since the prediction model PM2 in (B) is linear, referring to aninclination of the prediction model PM2 makes it possible to grasp withwhich parameter in the feature variable space SP2 each of the featurevariable vectors x_(n) is weighted and to correctly explain the featurevariable vectors x_(n). In the first embodiment, the plurality offeature variable vectors x_(n) in the feature variable space SP1 arereallocated to the other feature variable space SP2 without obtainingthe nonlinear prediction model PM1 like (A) for the plurality of featurevectors x_(n). The linear prediction model PM2 is thereby obtained;thus, it is possible to grasp with which parameter in the featurevariable space SP2 each of the feature variable vectors x_(n) isweighted and to correctly explain each feature variable vector x_(n) inresponse to the degree of importance.

In other words, the user can grasp which factor (feature) included inthe feature variables x_(n) contributes to the prediction result forevery sample (for example, for every patient) with the feature variablevectors x_(n); thus, it is easy to explain why such a prediction resultis obtained. Therefore, it is possible to improve accountability of themachine learning. According to the above example, the user can grasp whythe AI output the diagnosis result of “prone to catch a cold” (forexample, because of slimness) with respect to the certain patient.Furthermore, it is possible to improve efficiency of the machinelearning since it is unnecessary to execute the machine learning twiceas in (A). Therefore, it is possible to promptly provide an explanationdescribed above.

<Example of Structure of Neural Network>

FIG. 12 is an explanatory diagram depicting an example of a structure ofa neural network according to the fifth embodiment. A neural network1200 has a data unit group DU, a reporting unit group RU, a harmonizingunit group HU, a reallocation unit RAU, a unifying unit UU, a decisionunit DCU, and an importance unit IU.

The data unit group DU is configured such that a plurality of data unitsDU1 (l is a layer number and 1≤l≤L. L is the layer number of a lowestlayer and L=4 in FIG. 12.) are connected in series. The data unit DU1that is a highest layer of l=1 corresponds to an input layer 1201 andthe data units DU1 of l≤2 correspond to intermediate layers (alsoreferred to as hidden layers) of the neural network 1200. Each data unitDU1 is a perceptron to which output data from the data unit DU(l−1) of aprevious layer is input, in which calculation is performed using alearning parameter of the own data unit DU1, and from which output datais output.

It is noted, however, that the data unit DU1 holds the training data ata time of learning by the learning section 412. The training data meansherein, for example, sample data configured with a combination {x_(n),t_(n)} of images x_(n) as an example of the feature variable vectorx_(n) and the correct label t_(n) that is a true value of the images(n=1, 2, . . . , N, where N is the number of images). The images x_(n)are data having a two-dimensional matrix structure and hereinafterhandled as a d-dimensional vector (where d is an integer satisfying d≥1)obtained by raster scanning. In a case of denoting “x” for easierdescription, it is assumed that the vector is a one-dimensional vectorobtained by raster scanning the image x_(n) in a matrix form.

The correct label t_(n) is a K-dimensional vector that indicates a type(for example, an animal such as dog or cat) in a one-hot representationwith respect to the number of types K of the images x_(n). In theone-hot representation, one element of the vector corresponds to thetype of the images x_(n) and 1.0 is stored in the only one element,while 0.0 is stored in all the other elements. The type (for example, adog) corresponding to the element of 1.0 is the type that is a groundtruth. It is noted that in a case in which medical images x_(n) such asCT images, MRI images, or ultrasound images are an input, the labelt_(n) is a true value that represents a type of disease or a prognosis(good or bad) of a patient.

It is assumed that the images x_(n)∈R^(d) (where R^(d) is d-dimensionalreal numbers) are a feature variable vector configured with thed-dimensional real numbers R^(d). A function h^(l+1) _(D) that indicatesthe data unit DU(l+1) is expressed by the following Equation (13).

[Expression 11]

Equation (13)

h _(D) ^(i+1) =f _(D) ^(l)(W _(D) ^(l) h _(D) ^(l))  (13)

where h_(D) ^(i)∈

^(d) ^(l) is an input/output vector of the data unit W_(D) ^(l)∈

^(d) ^(l+1) ^(×d) ^(l) is a learning parameterwhen l=1, h_(D) ¹=x_(n)

In Equation (13), an index l (integer satisfying 1≤l≤L) denotes thelayer number (the same applies to the following equations). L is aninteger equal to or greater than 1 and denotes a layer number of thelowest layer. f_(D) ^(l) on a right side is an activation function. Asthe activation function, any of various activation functions such as thesigmoid function, a hyperbolic tangent function (tank function), and arectified linear unit (ReLU) function may be used. A matrix W^(l) _(D)is the learning parameter of the data unit DU1. A vector h^(l) _(D) onthe right side is an input vector input to the data unit DU1, that is,an output vector from the data unit DU1 that is the previous layer ofthe data unit DU1. It is noted that the output vector h^(l) _(D) fromthe data unit DU1 in a case in which the number of layers l=1 is h^(l)_(D)=x_(n).

It is noted that the data unit DU1 holds the images x_(n) that are thefeature variable vector as the test data at a time of prediction by theprediction section 414.

An output vector h^(l) _(D) from the data unit DU1 on the same layer isinput to each reporting unit RUI (2≤l≤L), and the reporting unit RU1contracts the number of dimensions of the output vector h^(l) _(D). Afunction h¹R that indicates the reporting unit RU1 is expressed by thefollowing Equation (14).

[Expression 12]

Equation (14)

h _(R) ^(l)=σ(W _(R) ^(l) h _(D) ^(l))  (14)

In Equation (14), a matrix W^(l) _(R) is a learning parameter of thereporting unit RU1. The d-dimensional output vector h^(l) _(D) from eachdata unit DU1 is contracted to an m-dimensional output vector h^(l) _(R)by Equation (14). Further, σ is the sigmoid function.

Each harmonizing unit HU1 (2≤l≤L) is provided between each data unit DU1on the intermediate layer and the reallocation unit RAU per data unitDU1 on the intermediate layer. The harmonizing units HU1 each convertthe number of dimensions of the output data from the data unit DU1 onthe intermediate layers into the same size. Therefore, pieces of theoutput data made to have the same number of dimensions by theharmonizing units HU1 are input to the reallocation unit RAU.

In other words, the output vector h^(l) _(D) is input to eachharmonizing unit HU1 from the data unit DU1 on the same layer, and theharmonizing units HU1 each convert the number of dimensions of theoutput vector h^(l) _(D) into the same number of dimensions. A functionh′H that indicates each harmonizing unit HU1 is expressed by thefollowing Equation (15).

[Expression 13]

Equation (15)

h _(H) ^(l) =f _(H)(W _(H) ^(l) h _(D) ^(l))  (15)

where W_(H) ^(l)

^(d) ^(l+1) ^(×d) ^(l) is a learning parameter

In Equation (15), a matrix W^(l) _(H) is a learning parameter of theharmonizing unit HU1. The d-dimensional output vector h^(l) _(D) fromthe data unit DU1 is thereby converted into an m-dimensional outputvector h^(l) _(H). It is noted that m is a hyperparameter thatdetermines the number of dimensions. d and m may differ from d and m inthe reporting units RU1. Furthermore, f_(H) is an activation function.

The attention unit AU calculates a weight α of each data unit DU1 usingthe output vector h^(l) _(R) from each reporting unit RU1. A function athat indicates the attention unit AU is expressed by the followingEquation (16).

[Expression 14]

Equation (16)

α=softmax(W _(A) h _(R))  (16)

where W_(A)

^((L−1)×m) (m=r(L−1)) is a learning parameter

In Equation (16), a matrix W_(A) is a learning parameter of theattention unit AU. A softmax function that is one type of activationfunction calculates a vector h_(R) in dimensions equal to the number oflayers (L=4 in an example of Equation (17) below). As indicated by thefollowing Equation (17), the vector h_(R) on the right side of Equation(16) is a vector obtained by stacking h^(l) _(R) in a perpendiculardirection.

$\begin{matrix}\left\lbrack {{Expression}\mspace{14mu} 15} \right\rbrack & \; \\{{Equation}\mspace{14mu} (17)} & \; \\{{h_{R} = \left\lbrack {h_{R}^{2};\ldots \;;h_{R}^{L}} \right\rbrack}{{{example}\mspace{14mu} {of}\mspace{14mu} L} = 4}} & (17) \\{{\begin{matrix}\begin{matrix}{h_{R}^{2} = \left\lbrack {0,1,0} \right\rbrack} \\{h_{R}^{3} = \left\lbrack {0,0,1} \right\rbrack}\end{matrix} \\{h_{R}^{4} = \left\lbrack {1,0,0} \right\rbrack}\end{matrix}\mspace{14mu} h_{R}} = {\begin{bmatrix}0 \\1 \\0 \\0 \\0 \\1 \\1 \\0 \\0\end{bmatrix}\begin{matrix}\begin{matrix}\begin{matrix}\begin{matrix}\; \\{\left. \begin{matrix}\; \\\; \\\;\end{matrix} \right\} h_{R}^{2}}\end{matrix} \\{\left. \begin{matrix}\; \\\; \\\;\end{matrix} \right\} h_{R}^{3}}\end{matrix} \\{\left. \begin{matrix}\; \\\begin{matrix}\; \\\;\end{matrix}\end{matrix} \right\} h_{R}^{4}}\end{matrix} \\\;\end{matrix}}} & \;\end{matrix}$

Therefore, the matrix W_(A) is a matrix of L rows by M columns (where Mis the number of elements of the vector h_(R)). By adopting the softmaxfunction in the attention unit AU, each element (a sum of all theelements is 1) of the vector h_(R) with the number of layers being Lrepresents the weight of the corresponding data unit DU1.

The reallocation unit RAU reallocates the feature variable vectors(images x_(n)) in one feature variable space to the other featurevariable space. Specifically, as depicted in FIGS. 11A and 11B, forexample, the prediction model obtained by a feature variable vectorgroup in the feature variable space SP1 can be nonlinear; thus, thereallocation unit RAU transfers the feature variable vector group to thefeature variable space SP2 so that a linear prediction model can beobtained in the feature variable space SP2. A function h^(l) _(T) thatindicates the reallocation unit RAU is expressed by the followingEquation (18).

[Expression 16]

Equation (18)

h _(T) ^(l) =f _(T)(h _(H) ^(l) ,x _(n))  (18)

As the function f_(T), the Hadamard product between the vectors, theelement addition, or the like can be used. In the present embodiment,the Hadamard product is used (refer to the following Equation (19)). InEquation (19), the Hadamard product between the output vector h^(l) _(H)from the harmonizing unit HU1 and the feature variable vector x_(n) isobtained.

[Expression 17]

Equation (19)

h _(T) ^(l) =h _(H) ^(l) ⊙x _(n)  (19)

The unifying unit UU unifies the output vector h^(l) _(T) from thereallocation unit RAU with the output vector α from the attention unitAU. In other words, the unifying unit UU weights the output vector h^(l)_(T) from the reallocation unit RAU with the output vector α from theattention unit AU. A function h_(U) that indicates the unifying unit UUis expressed by the following Equation (20).

[Expression 18]

Equation (20)

h _(U)=Σ_(k=1) ^(L−1)α[k]h _(T) ^(k+1)  (20)

In Equation (20), α[k] on the right side indicates an element (weight)in a k-th dimension of the output vector α of Equation (16).

The decision unit DCU decides on the predicted value y_(n) and outputsthe predicted value y_(n) to an output layer 1203. Specifically, thedecision unit DCU weights, for example, the output vector h_(U) from theunifying unit UU with a weight vector w_(o) that is one of the learningparameters and gives the resultant vector to the sigmoid function σ,thereby obtaining the predicted value y_(n). A function y_(n) thatindicates the decision unit DCU is expressed by the following Equation(21). In Equation (21), t in w_(o) ^(t) means transpose.

[Expression 19]

y _(n)  Equation (21)

The importance unit IU calculates a degree-of-importance vector s^(l)_(n) that indicates the degree of importance of a feature variable oneach layer of the neural network and outputs the degree-of-importancevector s^(l) _(n) to the output layer 1203. A function s^(l) _(n) thatindicates the importance unit IU is expressed by the following Equation(22).

[Expression 20]

Equation (22)

s _(n) ^(l)=α[l]f _(T)(w _(o) ,h _(H) ^(l)  (22)

In Equation (22), α[l] on the right side indicates an element (weight)in the 1-th layer of the output vector α of Equation (12). As thefunction f_(T), the Hadamard product between the vectors, the elementaddition, or the like can be used, similarly to Equation (18). In thefirst embodiment, the Hadamard product is used. In Equation (22), thedegree-of-importance vector s^(l) _(n) is the Hadamard product betweenthe weight vector w_(o) and the output vector h^(l) _(H) from theharmonizing unit HU1. The degree-of-importance vector s^(i)n is a degreeof importance of the n-th feature variable vector (image) x_(n) on thelayer l.

<Example of Functional Configuration of Data Analysis Apparatus 300>

FIG. 13 is a block diagram depicting an example of a functionalconfiguration of the data analysis apparatus 300 according to the fifthembodiment. The data analysis apparatus 300 has the input layer 1201,the intermediate layers 1202, the output layer 1203, a conversionsection 1301, a reallocation section 1302, a predicted data calculationsection 1303, a degree-of-importance calculation section 1304, a settingsection 1305, a unifying section 1306, and a contraction section 1307.These are an example of internal configurations of the learning section412 and the prediction section 414.

As indicated by Equation (15), the conversion section 1301 contracts thenumber of dimensions d of the output vector h^(l) _(D) from the DU1(l≥2) on each intermediate layer on the basis of the output vector h^(l)_(D) and the matrix W^(l) _(H), and outputs the output vector h^(l) _(H)after conversion. The conversion section 1301 is the harmonizing unitgroup HU described above.

As indicated by Equations (18) and (19), the reallocation section 1302reallocates each feature variable vector x_(n) in the feature variablespace SP1 given to the input layer 1201 to the feature variable spaceSP2 on the basis of the output vector h^(l) _(H) after conversion fromthe conversion section 1301 and the feature variable vector x_(n). Thereallocation section 1302 is the reallocation unit RAU described above.

As indicated by Equation (21), the predicted data calculation section1303 calculates the predicted vector y_(n) with respect to each featurevariable vector x_(n) in the feature space SP1 on the basis of areallocation result h_(T) ^(l) by the reallocation section 1302 and theweight vector w_(o). The predicted data calculation section 1303 is thedecision unit DCU described above.

As indicated by Equation (22), the degree-of-importance calculationsection 1304 calculates the degree-of-importance vector s^(l) _(n) ofthe feature variable vector x_(n) on each layer l of the intermediatelayers 1202 on the basis of the output vector h^(l) _(H) afterconversion and the weight vector w_(o). The degree-of-importancecalculation section 1304 is the importance unit IU described above.

For example, as for the images x_(n) that express an animal, it isassumed that an output vector h^(la) _(D) on one layer la is a featurevariable that indicates whether a contour of a face is suitable for acat and assumed that an output vector h^(lb) _(D) on one layer lb (≠la)is a feature variable that indicates whether a shape of an ear issuitable for the cat. In this case, referring to correspondingdegree-of-importance vectors s^(la) _(n) and s^(la) _(n) enables a userto explain in light of which feature of the face in the images x_(n) thedata analysis apparatus 300 discriminates the animal as a cat. Forexample, in a case in which the degree-of-importance vector s^(la) _(n)is low but the degree-of-importance vector s^(la) _(n) is high, the usercan explain that the data analysis apparatus 300 discriminates theanimal as a cat in light of the shape of the ear in the images x_(n). Itis noted that the calculated degree-of-importance vectors s^(l) _(n) areextracted by the degree-of-importance calculation section 415.

As indicated by Equations (16) and (17), the setting section 1305 setsthe weight α of each intermediate layer 1202 on the basis of the outputvector hip from the intermediate layer 1202 and the matrix W_(A). Thesetting section 1305 is the attention unit AU described above.

As indicated by Equation (20), the unifying section 1306 unifies thereallocation result h_(T) ^(l) with the weight α set by the settingsection 1305. The unifying section 1306 is the unifying unit UUdescribed above. In this case, the predicted data calculation section1303 calculates the predicted vector y_(n) on the basis of a unifyingresult h_(U) by the unifying section 1306 and the weight vector w_(o).Furthermore, the degree-of-importance calculation section 1304calculates the degree-of-importance vector s_(n) ^(l) on the basis ofthe weight α set by the setting section 1305, the output vector h^(l)_(H) after conversion and the weight vector w_(o).

As indicated by Equation (14), the contraction section 1307 contractsthe number of dimensions d of the output vector h^(l) _(D) from eachintermediate layer 1202 on the basis of the output vector h^(l) _(D)from the intermediate layer 1202 and the matrix W¹R, and outputs theoutput vector h^(l) _(R) after contraction. The contraction section 1307is the reporting unit group RU described above. In this case, thesetting section 1305 sets the weight α of each intermediate layer 1202on the basis of the output vector h^(l) _(R) after contraction from thecontraction section 1307 and the matrix W_(A).

In a case in which the training data that contains each feature variablevector x_(n) in the feature space SP1 and the correct label t_(n) withrespect to the predicted vector y_(n) is given, the learning section 412optimizes the matrix W^(l) _(D) that is a first learning parameter, thematrix W¹ _(H) that is a second learning parameter, the weight vectorw_(o) that is a third learning parameter, the matrix W_(A) that is afourth learning parameter, and the matrix W^(l) _(R) that is a fifthlearning parameter using the predicted vector y_(n) and the correctlabel t_(n) in such a manner, for example, that a cross entropy betweenthe correct label t_(n) and the predicted value y_(n) becomes a minimum.

The prediction section 414 sets the optimized learning parameters to thefirst neural network 1200 and gives each feature variable vector x′_(n)as the test data to the input layer 1201, thereby causing the predicteddata calculation section 1303 to calculate a predicted vector y′_(n).

In this way, according to the fifth embodiment, reallocating eachfeature variable vector x_(n) as the sample data in advance makes itpossible to calculate the degree of importance of each feature variableeven if the neural network is multi-layered, and to realizeaccountability per sample (feature variable vector x_(n)) with highaccuracy and with high efficiency. Moreover, since the linear predictionmodel is obtained by reallocating each sample (feature variable vectorsx_(n)) in advance, it is possible to calculate the predicted value withhigh accuracy and with low load at times of learning and prediction.

As described so far, according to the fifth embodiment, the dataanalysis apparatus 300 has the conversion section 1301, the reallocationsection 1302, and the degree-of-importance calculation section 1304.Therefore, the linear prediction model is obtained by reallocating thefeature variable vector (x_(n), x′_(n)) in advance; thus, it is possibleto calculate the predicted value with high accuracy and with low load attimes of learning and prediction. Furthermore, it is possible to graspfeatures possessed by the feature variable vector (x_(n), x′_(n)) by thedegree of importance of every layer l from the degree-of-importancecalculation section 1304. It is thereby possible to realizeaccountability about the feature variable vector (x_(n), x′_(n)) givento the neural network as an object to be analyzed with high accuracy andwith high efficiency.

Moreover, the data analysis apparatus 300 has the predicted datacalculation section 1303; thus, it is possible to realize accountabilityabout the reason for obtaining the prediction result (y_(n), y′_(n))from the neural network as an object to be analyzed with respect to thefeature variable vector (x_(n), x′_(n)) with high accuracy and with highefficiency.

Furthermore, the data analysis apparatus 300 has the setting section1305 and the unifying section 1306; thus, the predicted data calculationsection 1303 can calculate the prediction result based on thereallocation result with high accuracy.

Moreover, the data analysis apparatus 300 has the contraction section1307; thus, it is possible to improve efficiency of data analysis bycontraction of dimensions.

Furthermore, the data analysis apparatus 300 can construct a highaccuracy prediction model by learning of the learning parameters.

Since the degree of importance is obtained per feature variable, theselection section 411 can select the non-trivial feature variables onthe basis of the degrees of importance.

As described so far, the data analysis apparatus 300 described above canextract unknown feature variables that may be feature variables notdiscovered yet in academic findings or the like and contributing toprediction by increasing the degrees of contribution of the non-trivialfeature variables to prediction, reducing the degrees of contribution ofthe trivial feature variables, and minimizing a reduction in predictionaccuracy in the machine learning accountable for grounds for prediction.

The present invention is not limited to the embodiments described aboveand encompasses various modifications and equivalent configurationswithin the meaning of the accompanying claims. For example, theembodiments have been described above in detail for describing thepresent invention so that the present invention is easy to understand,and the present invention is not always limited to the embodimentshaving all the described configurations. Furthermore, apart ofconfigurations of one embodiment may be replaced by configurations ofthe other embodiment. Moreover, the configurations of the otherembodiment may be added to the configurations of the one embodiment.Further, for part of the configurations of each embodiment, addition,deletion, or replacement may be made of the other configurations.

Moreover, a part of or all of the configurations, the functions, theprocessing sections, processing means, and the like described above maybe realized by hardware by being designed, for example, as an integratedcircuit, or may be realized by software by causing the processor tointerpret and execute programs that realize the functions.

Information in programs, tables, files, and the like for realizing thefunctions can be stored in a storage device such as a memory, a harddisk, or a solid state drive (SSD), or in a recording medium such as anintegrated circuit (IC) card, a secure digital (SD) card, or a digitalversatile disk (DVD).

Furthermore, control lines or information lines considered to benecessary for the description are illustrated and all the control linesor the information lines necessary for implementation are not alwaysillustrated. In actuality, it may be contemplated that almost all theconfigurations are mutually connected.

What is claimed is:
 1. A data analysis apparatus (300) comprising: aprocessor (301) that executes a program; and a storage device (302) thatstores the program, wherein the processor (301) executes a selectionprocess for selecting a first feature variable group that is a trivialfeature variable group contributing to prediction and a second featurevariable group other than the first feature variable group from a set offeature variables, an operation process for operating a firstregularization coefficient related to a first weight parameter groupcorresponding to the first feature variable group among a set of weightparameters configuring a prediction model in such a manner that the lossfunction is larger, and operating a second regularization coefficientrelated to a second weight parameter group corresponding to the secondfeature variable group among the set of weight parameters configuringthe prediction model in such a manner that the loss function is smaller,in a loss function related to a difference between a prediction resultoutput in a case of inputting the set of feature variables to theprediction model and ground truth data corresponding to the featurevariables; and a learning process for learning the set of weightparameters of the prediction model in such a manner that the lossfunction is minimum as a result of operating the first regularizationcoefficient and the second regularization coefficient by the operationprocess.
 2. The data analysis apparatus (300) according to claim 1,wherein the processor (301) executes a prediction process forcalculating the prediction result and prediction accuracy of theprediction result based on the prediction result and the ground truthdata, and in the operation process, the processor (301) re-operates thefirst regularization coefficient and the second regularizationcoefficient in a case in which the prediction accuracy is equal to orlower than predetermined prediction accuracy, and in the learningprocess, the processor (301) re-learns the set of weight parameters ofthe prediction model in such a manner that the loss function is minimumas a result of re-operating the first regularization coefficient and thesecond regularization coefficient by the operation process.
 3. The dataanalysis apparatus (300) according to claim 1, wherein the processor(301) executes a degree-of-importance calculation process forcalculating a degree of importance of each of the feature variables on abasis of the set of weight parameters, and in the selection process, theprocessor (301) selects the first feature variable group and the secondfeature variable group on a basis of the degree of importance calculatedby the degree-of-importance calculation process.
 4. The data analysisapparatus (300) according to claim 3, wherein in thedegree-of-importance calculation process, the processor (301) calculatesa degree of importance of each first feature variable in the firstfeature variable group and a degree of importance of each second featurevariable in the second feature variable group in such a manner that thedegree of importance of the first feature variable is lower than thedegree of importance of the second feature variable.
 5. The dataanalysis apparatus (300) according to claim 1, wherein in the operationprocess, the processor (301) operates a range of each of the firstregularization coefficient and the second regularization coefficient tofall in a range of a sum of the first regularization coefficient and thesecond regularization coefficient.
 6. The data analysis apparatus (300)according to claim 5, wherein in the operation process, the processor(301) operates the range of each of the first regularization coefficientand the second regularization coefficient in response to an operationinput to a user interface displayed on the data analysis apparatus (300)or on another apparatus communicably connected to the data analysisapparatus (300).
 7. The data analysis apparatus (300) according to claim1, wherein in the selection process, the processor (301) selects thefirst feature variable group and a third feature variable group notcontributing to prediction and selects feature variables other than thefirst feature variable group and the third feature variable group as thesecond feature variable group from among the set of feature variables,in the operation process, in the loss function, the processor (301)operates the first regularization coefficient related to the firstweight parameter group corresponding to the first feature variable groupamong the set of weight parameters configuring the prediction model insuch a manner that the loss function is larger, operates the secondregularization coefficient related to the second weight parameter groupcorresponding to the second feature variable group in such a manner thatthe loss function is smaller, and operates a third regularizationcoefficient related to a third weight parameter group corresponding tothe third feature variable group in such a manner that the loss functionis larger, and in the learning process, the processor (301) learns theset of weight parameters of the prediction model in such a manner thatthe loss function is minimum as a result of operating the firstregularization coefficient, the second regularization coefficient, andthe third regularization coefficient by the operation process.
 8. Thedata analysis apparatus (300) according to claim 7, wherein theprocessor (301) executes a prediction process for calculating theprediction result and prediction accuracy of the prediction result basedon the prediction result and the ground truth data, and in the operationprocess, the processor (301) re-operates the first regularizationcoefficient, the second regularization coefficient, and the thirdregularization coefficient in a case in which the prediction accuracy isequal to or lower than predetermined prediction accuracy, and in thelearning process, the processor (301) re-learns the set of weightparameters of the prediction model in such a manner that the lossfunction is minimum as a result of re-operating the first regularizationcoefficient, the second regularization coefficient, and the thirdregularization coefficient by the operation process.
 9. The dataanalysis apparatus (300) according to claim 7, wherein the processor(301) executes a degree-of-importance calculation process forcalculating a degree of importance of each of the feature variables on abasis of the set of weight parameters, and in the selection process, theprocessor (301) selects the first feature variable group, the secondfeature variable group, and the third feature variable group on a basisof the degree of importance calculated by the degree-of-importancecalculation process.
 10. The data analysis apparatus (300) according toclaim 9, wherein in the degree-of-importance calculation process, theprocessor (301) calculates a degree of importance of each first featurevariable in the first feature variable group, a degree of importance ofeach second feature variable in the second feature variable group, and adegree of importance of each third feature variable in the third featurevariable group in such a manner that the degree of importance of thefirst feature variable and the degree of importance of the third featurevariable are lower than the degree of importance of the second featurevariable.
 11. The data analysis apparatus (300) according to claim 3,wherein the processor (301) executes a conversion process, in a neuralnetwork configured with an input layer, an output layer, and two or moreintermediate layers which are provided between the input layer and theoutput layer, in each of which calculation is performed by giving dataobtained from a previous layer and a first learning parameter that isthe set of weight parameters of the prediction model to an activationfunction, and from which a calculation result is output to a subsequentlayer, for converting the number of dimensions of output data from eachof the intermediate layers into a same number of dimensions on a basisof the output data from each of the intermediate layers and a secondlearning parameter and outputting output data after conversion from eachof the intermediate layers, and a reallocation process for reallocatingthe feature variables in a first feature variable space given to theinput layer to a second feature variable space on a basis of the outputdata after conversion from the conversion process and the featurevariables in the first feature variable space, and in thedegree-of-importance calculation process, the processor (301) calculatesthe degree of importance of each feature variable on each of theintermediate layers on a basis of the output data after conversion and athird learning parameter.
 12. The data analysis apparatus (300)according to claim 11, wherein in the learning process, the processor(301) adjusts the first learning parameter, the second learningparameter, and the third learning parameter using the prediction resultand ground truth data corresponding to the feature variables in thefirst feature variable space in a case in which training data containingthe feature variables in the first feature variable space and the groundtruth data corresponding to the feature variables is given.
 13. A dataanalysis method executed by a data analysis apparatus (300) having aprocessor (301) that executes a program and a storage device (302) thatstores the program, the method, executed by the processor (301),comprising: a selection process for selecting a first feature variablegroup that is a trivial feature variable group contributing toprediction and a second feature data group other than the first featurevariable group from a set of feature variables; an operation process foroperating a first regularization coefficient related to a first weightparameter group corresponding to the first feature variable group amonga set of weight parameters configuring a prediction model in such amanner that the loss function is larger, and operating a secondregularization coefficient related to a second weight parameter groupcorresponding to the second feature variable group among the set ofweight parameters configuring the prediction model in such a manner thatthe loss function is smaller, in a loss function related to a differencebetween a prediction result output in a case of inputting the set offeature variables to the prediction model and ground truth datacorresponding to the feature variables; and a learning process forlearning the set of weight parameters of the prediction model in such amanner that the loss function is minimum as a result of operating thefirst regularization coefficient and the second regularizationcoefficient by the operation process.
 14. A data analysis program for aprocessor (301) comprising: a selection process for selecting a firstfeature variable group that is a trivial feature variable groupcontributing to prediction and a second feature data group other thanthe first feature variable group from a set of feature variables; anoperation process for operating a first regularization coefficientrelated to a first weight parameter group corresponding to the firstfeature variable group among a set of weight parameters configuring aprediction model in such a manner that the loss function is larger, andoperating a second regularization coefficient related to a second weightparameter group corresponding to the second feature variable group amongthe set of weight parameters configuring the prediction model in such amanner that the loss function is smaller, in a loss function related toa difference between a prediction result output in a case of inputtingthe set of feature variables to the prediction model and ground truthdata corresponding to the feature variables; and a learning process forlearning the set of weight parameters of the prediction model in such amanner that the loss function is minimum as a result of operating thefirst regularization coefficient and the second regularizationcoefficient by the operation process.